EPILEPTIC SEIZURE RECOGNITION

Data Manipulation

In [1]:
import pandas as pd 
In [2]:
raw = pd.read_csv("epileptic-seizure-recognition.csv")
raw = raw.drop("Unnamed",axis=1)
raw.head()
Out[2]:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 ... X170 X171 X172 X173 X174 X175 X176 X177 X178 y
0 135 190 229 223 192 125 55 -9 -33 -38 ... -17 -15 -31 -77 -103 -127 -116 -83 -51 4
1 386 382 356 331 320 315 307 272 244 232 ... 164 150 146 152 157 156 154 143 129 1
2 -32 -39 -47 -37 -32 -36 -57 -73 -85 -94 ... 57 64 48 19 -12 -30 -35 -35 -36 5
3 -105 -101 -96 -92 -89 -95 -102 -100 -87 -79 ... -82 -81 -80 -77 -85 -77 -72 -69 -65 5
4 -9 -65 -98 -102 -78 -48 -16 0 -21 -59 ... 4 2 -12 -32 -41 -65 -83 -89 -73 5

5 rows × 179 columns

In the data, each of the 178 columns is the value of EEG recording for 1 second. People are grouped into 5 different groups based on the inputs coming from the EEG recordings. If we show the groups with a range of numbers from 1 to 5;

  • 1: This group represents epileptic patients who experienced epileptic seizures during the EEG recording.
  • 2-3: These groups represent epileptic patients who did not experience a seizure during the EEG recording. The difference between these groups is that EEG is recorded from different areas of the brain.
  • 4: This group represents healthy participants who did not experience a seizure and their eyes were open during EEG recording.
  • 5: This group represents healthy participants who did not experience a seizure and their eyes were closed during EEG recording.

Since the detection of epileptic seizures is the main focus of this project, these five groups can be divided into two groups based on the epileptic seizure activity. The people who are grouped in classes 2,3,4, and 5 are the ones whose EEG recording does not show any epileptic seizure activity during the recording and those people will be labeled as 0. The people who are grouped in class 1, on the other hand, had some seizure activity during recording and they will be labeled as 1.

In [3]:
y = raw["y"]
e = []
for x in range(0,len(raw)):
    if y[x] in [2,3,4,5]: 
        e.append(0)
    else:
        e.append(1)
In [4]:
raw.append(e)

raw.insert(179,"e",e,True)
raw.head()
Out[4]:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 ... X171 X172 X173 X174 X175 X176 X177 X178 y e
0 135 190 229 223 192 125 55 -9 -33 -38 ... -15 -31 -77 -103 -127 -116 -83 -51 4 0
1 386 382 356 331 320 315 307 272 244 232 ... 150 146 152 157 156 154 143 129 1 1
2 -32 -39 -47 -37 -32 -36 -57 -73 -85 -94 ... 64 48 19 -12 -30 -35 -35 -36 5 0
3 -105 -101 -96 -92 -89 -95 -102 -100 -87 -79 ... -81 -80 -77 -85 -77 -72 -69 -65 5 0
4 -9 -65 -98 -102 -78 -48 -16 0 -21 -59 ... 2 -12 -32 -41 -65 -83 -89 -73 5 0

5 rows × 180 columns

Principal Component Analysis for Data Visualization

To get an idea about the distribution of the classes, the data can be visualized after reducing the number of features from 178 to only 2. Principal Component Analysis is one of the most widely used approaches in this dimensionality reduction process and I will use it to visualize the data.

Because the scale of the features is very different, they should be standardized first before starting PCA.

In [5]:
from sklearn.preprocessing import StandardScaler

features = raw.columns[0:178]
X = raw.loc[:, features].values
y = raw.loc[:,"y"]
e = raw.loc[:,"e"]

X = StandardScaler().fit_transform(X)
In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, e_train, e_test = train_test_split(X, e, test_size = 0.25, random_state = 21, stratify = e)
In [7]:
from sklearn.decomposition import PCA 
import pandas as pd
from bokeh.plotting import figure, output_file, show
from bokeh.io import output_notebook,reset_output,save,show

output_notebook()
def pca_visualization(X,y):
    pca = PCA(n_components = 2)
    pc = pca.fit_transform(X)
    pcDF = pd.DataFrame(data = pc, columns = ['pc1','pc2'])
    pcDF['e'] = y.values
    epilepsy = pcDF.loc[pcDF['e']==1,]
    not_epilepsy = pcDF.loc[pcDF['e']==0,]
    reset_output()
    output_notebook()
    p = figure()
    p.scatter(x = epilepsy['pc1'], 
              y = epilepsy['pc2'], 
              marker="x", 
              fill_alpha=0.5, 
              size=6, 
              legend = "Epilepsy")
    p.scatter(x = not_epilepsy['pc1'], 
              y = not_epilepsy['pc2'], 
              marker="triangle", 
              fill_alpha=0.3, 
              size=6, 
              fill_color="orange",
              line_color="orange",
              legend = "Not Epilepsy")
    p.xaxis.axis_label = "Principal Component 1"
    p.yaxis.axis_label = "Principal Component 2"
    show(p)
Loading BokehJS ...
In [8]:
pca_visualization(X_train,e_train)
Loading BokehJS ...
In [55]:
import numpy as np
rate = np.mean(e_train)
length = len(e_train)

print("There are {}".format(length), "examples in the training set and only {}".format(round(length*rate)), "of them are labeled as Epilepsy.")
There are 8625 examples in the training set and only 1725 of them are labeled as Epilepsy.

As can be seen from the graph, which is obtained after reducing the number of features with PCA, and the number of positive labels, the data is imbalanced. Putting an imbalanced dataset into a machine learning algorithm directly may result in inaccurate predictions since the algorithm may ignore the minority class. I will handle this problem with oversampling.

Over-Sampling with SMOTE

SMOTE is one of the most commonly used oversampling techniques which simply synthesizes new data for the minority classes to obtain an equal number of cases for each label. It is important to note that the algorithm should be used only for training dataset to prevent bias.

In [8]:
from imblearn.over_sampling import SMOTE

oversampling = SMOTE()
X_train_b, e_train_b = oversampling.fit_resample(X_train,e_train)
In [57]:
from collections import Counter

counter_imb = Counter(e_train)
counter_b = Counter(e_train_b)

print(counter_imb)
print(counter_b)
Counter({0: 6900, 1: 1725})
Counter({1: 6900, 0: 6900})
In [58]:
pca_visualization(X_train_b,e_train_b)
Loading BokehJS ...
In [59]:
rate = np.mean(e_train_b)
length = len(e_train_b)

print("There are {}".format(length), "examples in the training set after SMOTE and now {}".format(round(length*rate)), "of them are labeled as Epilepsy.")
There are 13800 examples in the training set after SMOTE and now 6900 of them are labeled as Epilepsy.

Now, the data is balanced. Because there is a high number of features, it may take a long time when the models process the data. A good way to shorten this period is to reduce the number of features by applying feature selection/extraction methods. In this project, I will obtain three different training sets by using three different dimensionality reduction techniques, measure the performance of the classification for three different scenarios, and select the training set that results in the highest performance in the test set.

Principal Component Analysis for Speeding Up the Machine Learning Algorithm

Principal Component Analysis is a feature extraction method that is used to reduce the number of features. It transforms correlated features into linear and independent components.

In this project, I selected the number of principal components as 40 since this number is relatively small and the selected principal components still represent the original data at a good level.

In [9]:
pca = PCA(n_components = 40)
pca.fit(X_train_b)
Out[9]:
PCA(n_components=40)
In [10]:
X_train_reduced_pca = pca.transform(X_train_b)
X_test_reduced_pca = pca.transform(X_test)

To be able to make better comparisons between three different datasets that are obtained after three different dimensionality reduction methods, I wanted these datasets to have the same size. Therefore, I will select the most important 40 features in the other dimensionality reduction methods.

Feature Importance with Extra Tree Classifier

Extra Tree Classifier is an Ensemble Learning method. It is very similar to the Random Forest model. The only difference is that the entire dataset is used to create decision trees and the features and splits are selected at random in this algorithm.

Like in the other Ensemble Learning models, the ones that result in the highest purity/lowest impurity are selected as the most important features in this method.

In [19]:
from sklearn.ensemble import ExtraTreesClassifier

feature_imp = []

for x in range(1,500):
    model = ExtraTreesClassifier()
    model.fit(X_train_b,e_train_b)
    feature_imp =+ model.feature_importances_
In [114]:
from operator import itemgetter
from heapq import nlargest, nsmallest

largest40 = nlargest(40, enumerate(feature_imp), itemgetter(1))
In [115]:
largest40_col_name = []
largest40_values = []
largest40_index = []

for index, value in largest40: 
    name = 's'+ str(index+1)
    largest40_col_name.append(name)
    largest40_values.append(value)
    largest40_index.append(index)
In [116]:
output_notebook()

p = figure(y_range = largest40_col_name, title = "Most Important 40 Features", x_axis_label = "Importance", tools = "")
p.hbar(y = largest40_col_name, right = largest40_values, left = 0, height = 0.5, color = 'orange', fill_alpha = 0.5)
show(p)
Loading BokehJS ...
In [117]:
X_train_f_reduced = X_train_b[:,largest40_index]
X_test_f_reduced = X_test[:,largest40_index]

Sequential Forward Feature Selection

Sequential Forward Feature Selection is another common feature selection method that is used to filter the most important features. The algorithm detects the feature that has the highest correlation with the target variable first. In the next step, pairs of features are formed by using the selected feature and one of the remaining features, and the best pair is selected. Then the triplets are formed by using the selected features and one of the remaining features and the best triplet is selected. This goes until reaching a predefined point. The disadvantage of this algorithm is its slow running time.

In [134]:
from mlxtend.feature_selection import SequentialFeatureSelector as sfs

mdl = KNeighborsClassifier(n_neighbors = 1)

sequential_forward = sfs(mdl,
                         k_features=40,
                         forward=True,
                         floating=False,
                         verbose=2,
                         scoring='accuracy',
                         cv=10)
In [135]:
sequential_forward_forward = step_forward.fit(X_train_b, e_train_b)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 178 out of 178 | elapsed:  1.5min finished

[2020-06-19 00:25:25] Features: 1/40 -- score: 0.87268115942029[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 177 out of 177 | elapsed:  1.5min finished

[2020-06-19 00:26:55] Features: 2/40 -- score: 0.794927536231884[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 176 out of 176 | elapsed:  1.6min finished

[2020-06-19 00:28:32] Features: 3/40 -- score: 0.8552173913043479[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 175 out of 175 | elapsed:  1.8min finished

[2020-06-19 00:30:23] Features: 4/40 -- score: 0.8965942028985507[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 174 out of 174 | elapsed:  2.0min finished

[2020-06-19 00:32:24] Features: 5/40 -- score: 0.9214492753623189[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 173 out of 173 | elapsed:  2.4min finished

[2020-06-19 00:34:48] Features: 6/40 -- score: 0.94268115942029[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 172 out of 172 | elapsed:  2.8min finished

[2020-06-19 00:37:36] Features: 7/40 -- score: 0.9551449275362319[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 171 out of 171 | elapsed:  3.4min finished

[2020-06-19 00:40:58] Features: 8/40 -- score: 0.9643478260869566[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 170 out of 170 | elapsed:  4.1min finished

[2020-06-19 00:45:01] Features: 9/40 -- score: 0.9719565217391304[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 169 out of 169 | elapsed:  4.8min finished

[2020-06-19 00:49:52] Features: 10/40 -- score: 0.9776086956521739[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 168 out of 168 | elapsed:  5.8min finished

[2020-06-19 00:55:38] Features: 11/40 -- score: 0.9819565217391304[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 167 out of 167 | elapsed:  6.7min finished

[2020-06-19 01:02:17] Features: 12/40 -- score: 0.9856521739130434[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 166 out of 166 | elapsed:  7.7min finished

[2020-06-19 01:09:57] Features: 13/40 -- score: 0.9876086956521739[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 165 out of 165 | elapsed:  8.6min finished

[2020-06-19 01:18:33] Features: 14/40 -- score: 0.9892028985507247[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 164 out of 164 | elapsed:  9.5min finished

[2020-06-19 01:28:06] Features: 15/40 -- score: 0.9906521739130436[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 163 out of 163 | elapsed: 10.3min finished

[2020-06-19 01:38:23] Features: 16/40 -- score: 0.991521739130435[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 162 out of 162 | elapsed: 11.2min finished

[2020-06-19 01:49:33] Features: 17/40 -- score: 0.9921739130434784[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 161 out of 161 | elapsed: 12.0min finished

[2020-06-19 02:01:32] Features: 18/40 -- score: 0.9931159420289856[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 160 out of 160 | elapsed: 12.7min finished

[2020-06-19 02:14:17] Features: 19/40 -- score: 0.9936956521739131[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 159 out of 159 | elapsed: 13.4min finished

[2020-06-19 02:27:43] Features: 20/40 -- score: 0.9941304347826087[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    5.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 158 out of 158 | elapsed: 14.5min finished

[2020-06-19 02:42:13] Features: 21/40 -- score: 0.9947101449275364[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    5.9s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 157 out of 157 | elapsed: 15.5min finished

[2020-06-19 02:57:43] Features: 22/40 -- score: 0.9953623188405798[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    6.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 156 out of 156 | elapsed: 16.5min finished

[2020-06-19 03:14:12] Features: 23/40 -- score: 0.9956521739130435[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    6.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 155 out of 155 | elapsed: 17.4min finished

[2020-06-19 03:31:37] Features: 24/40 -- score: 0.9958695652173913[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    7.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 154 out of 154 | elapsed: 18.1min finished

[2020-06-19 03:49:45] Features: 25/40 -- score: 0.9961594202898552[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    7.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 153 out of 153 | elapsed: 19.1min finished

[2020-06-19 04:08:53] Features: 26/40 -- score: 0.9960869565217392[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    8.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 152 out of 152 | elapsed: 20.2min finished

[2020-06-19 04:29:05] Features: 27/40 -- score: 0.9963768115942029[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    8.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 151 out of 151 | elapsed: 21.0min finished

[2020-06-19 04:50:03] Features: 28/40 -- score: 0.9965217391304348[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    8.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed: 22.1min finished

[2020-06-19 05:12:09] Features: 29/40 -- score: 0.9966666666666667[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    9.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 149 out of 149 | elapsed: 23.0min finished

[2020-06-19 05:35:08] Features: 30/40 -- score: 0.9968115942028986[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    9.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 148 out of 148 | elapsed: 23.8min finished

[2020-06-19 05:58:56] Features: 31/40 -- score: 0.9968840579710145[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   10.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 147 out of 147 | elapsed: 24.7min finished

[2020-06-19 06:23:39] Features: 32/40 -- score: 0.9968115942028986[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   10.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 146 out of 146 | elapsed: 25.6min finished

[2020-06-19 06:49:16] Features: 33/40 -- score: 0.9968115942028986[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   10.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 145 out of 145 | elapsed: 26.1min finished

[2020-06-19 07:15:23] Features: 34/40 -- score: 0.9968115942028986[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   11.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 144 out of 144 | elapsed: 26.9min finished

[2020-06-19 07:42:15] Features: 35/40 -- score: 0.9968115942028986[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   11.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 143 out of 143 | elapsed: 27.4min finished

[2020-06-19 08:09:38] Features: 36/40 -- score: 0.9969565217391304[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   11.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 142 out of 142 | elapsed: 28.2min finished

[2020-06-19 08:37:51] Features: 37/40 -- score: 0.9970289855072464[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   12.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 141 out of 141 | elapsed: 29.9min finished

[2020-06-19 09:07:44] Features: 38/40 -- score: 0.9969565217391304[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   12.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 140 out of 140 | elapsed: 33.7min finished

[2020-06-19 09:41:26] Features: 39/40 -- score: 0.9971739130434782[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   13.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 139 out of 139 | elapsed: 33.8min finished

[2020-06-19 10:15:16] Features: 40/40 -- score: 0.9972463768115942
In [137]:
feat_cols = list(sequential_forward.k_feature_idx_)
print(feat_cols)
[3, 9, 13, 16, 22, 23, 27, 28, 31, 34, 37, 38, 39, 42, 43, 47, 48, 53, 59, 64, 69, 73, 76, 83, 85, 90, 95, 102, 111, 125, 131, 141, 142, 147, 152, 155, 159, 164, 174, 177]
In [144]:
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10,10)
In [146]:
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs

fig = plot_sfs(sequential_forward.get_metric_dict(), kind='std_dev')

plt.ylim([0.75, 1])
plt.title('Sequential Forward Selection')
plt.grid()
plt.show()

As can be seen from the graph, the selected 40 features result in the high performance in the validation set. (Because we do parameter selection, the performance is measured in the validation set). This means that adding more features will not affect the performance of the model. It will only increase the training time.

In [138]:
X_train_sf_reduced = X_train_b[:,feat_cols]
X_test_sf_reduced = X_test[:,feat_cols]

At the end of these dimensionality reduction methods, I obtained three different training sets that have the same number of samples and features. In the next step, I will put those datasets into the exactly same machine learning algorithm, K-Nearest Neighbors with k = 1, make classification on the same dataset (test set) and make comparisons among the dimensionality reduction methods.

K-Nearest Neighbors after Feature Extraction with Principal Component Analysis

In [45]:
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt

def model_comparison(X_train, y_train, X_test, y_test, model, parameters, cv, scoring):
    model_cv = GridSearchCV(model, parameters, cv = cv, scoring = scoring)
    model_cv.fit(X_train,y_train)
    prediction = model_cv.predict(X_test)
    prediction_prob = model_cv.predict_proba(X_test)[:,1]
    print("Classification report:\n\n",classification_report(prediction,y_test))
    plot_confusion_matrix(model_cv, X_test, y_test,
                          display_labels = ['Not Epilepsy','Epilepsy'],
                          cmap = plt.cm.Blues, 
                          values_format = ".2f")
    return prediction, prediction_prob, model_cv

K-Nearest Neighbors is one of the simplest and most commonly used algorithms in machine learning. It takes the k closest samples into account when predicting the unknown label of a sample. As we decrease the k value, the algorithm becomes more complex and tends to take noise into account and this may result in overfitting while increasing k value may result in underfitting.

In general, I will prefer to use the F1 score during hyperparameter optimization to be able to obtain higher precision and recall values in the test set.

In [433]:
from sklearn.neighbors import KNeighborsClassifier
param_knn = {'n_neighbors':np.arange(1,5)}

knn_prediction_pca, knn_prediction_prob_pca, knn_pca = model_comparison(X_train = X_train_reduced_pca,
                                                                        y_train = e_train_b, 
                                                                        X_test = X_test_reduced_pca,
                                                                        y_test = e_test, 
                                                                        model = KNeighborsClassifier(),
                                                                        parameters = param_knn,
                                                                        cv=10, scoring= 'f1')
Classification report:

               precision    recall  f1-score   support

           0       1.00      0.96      0.98      2388
           1       0.83      0.98      0.90       487

    accuracy                           0.96      2875
   macro avg       0.91      0.97      0.94      2875
weighted avg       0.97      0.96      0.96      2875

K-Nearest Neighbors after Feature Selection with Extra Trees Classifier

In [122]:
knn_prediction_f, knn_prediction_prob_f, knn_f = model_comparison(X_train = X_train_f_reduced,
                                                                  y_train = e_train_b, 
                                                                  X_test = X_test_f_reduced,
                                                                  y_test = e_test, 
                                                                  model = KNeighborsClassifier(),
                                                                  parameters = param_knn,
                                                                  cv=10, scoring= 'f1')
Classification report:

               precision    recall  f1-score   support

           0       0.99      0.96      0.97      2377
           1       0.83      0.96      0.89       498

    accuracy                           0.96      2875
   macro avg       0.91      0.96      0.93      2875
weighted avg       0.96      0.96      0.96      2875

K-Nearest Neighbors after Feature Selection with Sequential Forward Feature Selection

In [139]:
param_knn = {'n_neighbors':np.arange(1,5)}
knn_prediction_sf, knn_prediction_prob_sf, knn_sf = model_comparison(X_train = X_train_sf_reduced,
                                                                     y_train = e_train_b, 
                                                                     X_test = X_test_sf_reduced,
                                                                     y_test = e_test, 
                                                                     model = KNeighborsClassifier(),
                                                                     parameters = param_knn,
                                                                     cv=10, scoring= 'f1')
Classification report:

               precision    recall  f1-score   support

           0       0.99      0.96      0.98      2376
           1       0.84      0.97      0.90       499

    accuracy                           0.96      2875
   macro avg       0.92      0.96      0.94      2875
weighted avg       0.97      0.96      0.96      2875

When we make a comparison between the performance of the three different classifications, it looks like all of 3 dimensionality reduction techniques result in very similar classifications. The differences between the predictions are most likely due to the randomness. Since the running time is lowest in PCA, I will use the data which it's dimension is reduced with PCA for the rest of the project.

Decision Tree

Decision Tree is one of the simple and interpretable algorithms in machine learning. The algorithm asks questions and makes classifications based on the answers to these questions. There are many hyperparameters used in this algorithm. However, I chose the most common ones to make hyperparameter optimization.

  • max_depth: The maximum depth of the decision tree. Increasing this parameter results in a larger and more complex decision tree and this may result in high performance in the data which is used to create the decision tree and low performance in the new data set.
  • max_features: The number of features considered for the best split. As I see, increasing this parameter results in better performance in general. However, this also reduces training time.
  • min_samples_split: The minimum number of samples required to split an internal node. Increasing this parameter causes the decision tree to be more constrained and smaller. This may result in underfitting.
  • max_samples_leaf: The minimum number of samples required to split a leaf node. Like in the previous scenario, increasing these parameters causes the decision tree to be more constrained and smaller. We may observe underfitting at the end.
In [46]:
from sklearn.tree import DecisionTreeClassifier

param_dtc = {'max_depth':         [200,250,300,1000],
             'max_features':      [25,30,35],
             'min_samples_split': [5,10,20,30,40,50],
             'min_samples_leaf':  [25,50,150,250,500,1000]}

dtc_prediction_pca, dtc_prediction_prob_pca, dtc_pca = model_comparison(X_train = X_train_reduced_pca,
                                                                        y_train = e_train_b, 
                                                                        X_test  = X_test_reduced_pca,
                                                                        y_test  = e_test, 
                                                                        model   = DecisionTreeClassifier(),
                                                                        parameters = param_dtc,
                                                                        cv = 10, 
                                                                        scoring = 'f1')
Classification report:

               precision    recall  f1-score   support

           0       0.91      0.98      0.95      2132
           1       0.94      0.73      0.82       743

    accuracy                           0.92      2875
   macro avg       0.93      0.86      0.88      2875
weighted avg       0.92      0.92      0.91      2875

Random Forest

Random Forest is simply the combination of multiple decision trees. A Decision Tree is built using the entire dataset and all the features. In Random Forest, however, samples are taken from the data randomly, and then Decision Trees are created based on these samples until reaching a predefined number of Decision Trees.

The parameters in the Random Forest are similar to the Decision Tree. Because it takes a lot of time to run the Random Forest model on the training data, I only used three parameters to make hyperparameter optimization. The parameters of 'max_features' and 'max_depth' are like the ones used in the Decision Tree.

  • n_estimators: Number of Decision Trees in the model. As I see, increasing this parameter results in better performance in general and slower training time.

Note: After detecting that the recall score of the Random Forest is very high and the precision score is very low with F1 scoring and changing the scoring as precision does not affect the recall score negatively, I used precision for hyperparameter optimization in Random Forest to be able to obtain more precise model.

In [175]:
from sklearn.ensemble import RandomForestClassifier

param_rf = {'n_estimators': [300,500,700],
            'max_features': [3,5,7],
            'max_depth':    [80,100,120]}

rf_prediction_pca, rf_prediction_prob_pca, rf_pca = model_comparison(X_train = X_train_reduced_pca,
                                                                     y_train = e_train_b, 
                                                                     X_test = X_test_reduced_pca,
                                                                     y_test = e_test, 
                                                                     model = RandomForestClassifier(),
                                                                     parameters = param_rf,
                                                                     cv=10, scoring = 'precision')
Classification report:

               precision    recall  f1-score   support

           0       0.96      1.00      0.98      2213
           1       0.99      0.86      0.92       662

    accuracy                           0.97      2875
   macro avg       0.98      0.93      0.95      2875
weighted avg       0.97      0.97      0.97      2875

Neural Network

Neural Network is a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs. Neural networks are typically organized in layers. Layers are made up of a number of interconnected 'nodes' which contain an 'activation function'. Patterns are presented to the network via the 'input layer', which communicates to one or more 'hidden layers' where the actual processing is done via a system of weighted 'connections'. The hidden layers then link to an 'output layer' where the answer is output. [1]

Like in the machine learning algorithms, the main purpose here is to minimize the difference between the predicted value and actual value. After building a Neural Network, we can show the difference between predicted output and actual value with a function. Many algorithms are used to find the parameters of this function that result in the minimum difference between the predicted value and actual value. We also call these algorithms as optimizers.

In this project, I preferred to use Adam optimizer which can also be seen as the combination of AdaGrad and RMSProp optimizers.

In [366]:
from keras.models import Sequential
from keras.layers import Dense
import tensorflow as tf
from sklearn import metrics
from sklearn.model_selection import KFold
from keras.layers import Activation
from keras.optimizers import Adam

X_reduced_pca = np.concatenate((X_train_reduced_pca,X_test_reduced_pca))
e_reduced_pca = np.concatenate((e_train_b, e_test))

kf_cv = KFold(10, shuffle=True, random_state = 42)

oos_y = []
oos_pred = []
fold = 0

def tuning_parameters(learning_rate, epochs):
    for train, test in kf_cv.split(X_reduced_pca):       
        x_train = X_reduced_pca[train]
        y_train = e_reduced_pca[train]
        x_test = X_reduced_pca[test]
        y_test = e_reduced_pca[test]
    
        nn = Sequential()
        nn.add(Dense(100, input_dim = 40, activation = 'relu'))
        nn.add(Dense(100, activation = 'relu'))
        nn.add(Dense(1, activation = 'sigmoid'))
        opt = Adam(lr = learning_rate)
        nn.compile(loss='binary_crossentropy', optimizer=opt, metrics=tf.keras.metrics.RecallAtPrecision(precision=0.9))
    
        nn.fit(x_train,y_train,validation_data=(x_test,y_test), epochs = epochs, verbose = 0)
    
        y_pred = nn.predict_classes(x_test)
    
        oos_y.append(y_test)
        oos_y_concat = np.concatenate(oos_y)
    
        oos_pred.append(y_pred)    
        oos_pred_concat = np.concatenate(oos_pred)
    
    f1_score = metrics.f1_score(oos_pred_concat,oos_y_concat)
    rec = metrics.recall_score(oos_pred_concat,oos_y_concat)
    prec = metrics.precision_score(oos_pred_concat,oos_y_concat)
    return f1_score, rec, prec
In [367]:
learning_rates = [0.0001,0.001, 0.01, 0.1, 1]

f1 = []
rec = []
prec = []

for i in range(len(learning_rates)): 
    x,y,z = tuning_parameters(learning_rate = learning_rates[i], epochs = 100)
    f1.append(x)
    rec.append(y)
    prec.append(z)
In [416]:
nn_pca = Sequential()
nn_pca.add(Dense(1200, input_dim = 40, activation = 'relu'))
nn_pca.add(Dense(1200, activation = 'relu'))
nn_pca.add(Dense(1200, activation = 'relu'))
nn_pca.add(Dense(1, activation = 'sigmoid'))
opt = Adam(lr = 0.001)
nn_pca.compile(loss='binary_crossentropy', optimizer=opt, metrics=tf.keras.metrics.Recall())
nn_pca.fit(X_train_reduced_pca,e_train_b,validation_data=(X_test_reduced_pca,e_test), epochs = 30, verbose = 0)
Out[416]:
<tensorflow.python.keras.callbacks.History at 0x278934f4648>
In [533]:
from sklearn.metrics import confusion_matrix
from mlxtend.plotting import plot_confusion_matrix as plot_cm

nn_prediction_prob_pca = nn_pca.predict(X_test_reduced_pca)
nn_prediction_prob_pca_cm = (nn_prediction_prob_pca > 0.5)

nn_prediction_pca = nn_pca.predict_classes(X_test_reduced_pca)

cm = confusion_matrix(e_test, nn_prediction_prob_pca_cm)

class_names = ['Not Epilepsy', 'Epilepsy']

fig,ax  = plot_cm(conf_mat=cm,
                  colorbar=True,
                  class_names = class_names,
                  figsize = (6,6))

Model Selection

After applying a range of algorithms, the next step is to compare them among each other from different perspectives and select the one to make a prediction. Because the data we used to make a prediction is imbalanced, I preferred to choose precision and recall scores as one of my criteria in model selection.

In [64]:
from sklearn.metrics import precision_recall_curve, plot_precision_recall_curve, f1_score
import matplotlib.pyplot as plt

precision_knn, recall_knn, _ = precision_recall_curve(e_test,knn_prediction_prob_pca)
precision_dtc, recall_dtc, _ = precision_recall_curve(e_test,dtc_prediction_prob_pca)
precision_rf, recall_rf, _ = precision_recall_curve(e_test,rf_prediction_prob_pca)
precision_nn, recall_nn, _ = precision_recall_curve(e_test,nn_prediction_prob_pca)

f1_knn = f1_score(e_test, knn_prediction_pca)
f1_dtc = f1_score(e_test, dtc_prediction_pca)
f1_rf = f1_score(e_test, rf_prediction_pca)
f1_nn = f1_score(e_test, nn_prediction_pca)

fig = plt.figure(figsize= (10,10))
ax = fig.add_subplot(1,1,1)
ax.plot(recall_knn, precision_knn, label = "K-Nearest Neighbors (F1 = {:.5f})".format(f1_knn), c = 'gray')
ax.plot(recall_dtc, precision_dtc, label = "Decision Tree (F1 = {:.5f})".format(f1_dtc), c = 'orange')
ax.plot(recall_rf, precision_rf, label = "Random Forest (F1 = {:.5f})".format(f1_rf), c = 'purple')
ax.plot(recall_nn, precision_nn, label = "Neural Network (F1 = {:.5f})".format(f1_nn), c = 'green')

ax.grid()
ax.legend()
ax.set_xlabel("Recall")
ax.set_ylabel("Precision")
ax.set_title("Precision Recall Curve for Different Models")
plt.show()
In [65]:
from sklearn.metrics import roc_auc_score, accuracy_score, recall_score, precision_score
def specificity_score(e_act, e_pred, threshold): 
    return sum((e_pred < threshold) & (e_act == 0))/sum(e_act == 0)

def report(e_act, e_pred, threshold): 
    precision = precision_score(e_act, (e_pred>threshold))
    recall = recall_score(e_test, (e_pred>threshold))
    specificity = specificity_score(e_act, e_pred, threshold)
    accuracy = accuracy_score(e_act, (e_pred>threshold))
    print('Precision:%.5f'%precision)   
    print('Recall:%.5f'%recall)
    print('Specificity:%.5f'%specificity)
    print('Accuracy:%.5f'%accuracy)
    return precision, recall, specificity, accuracy
In [66]:
print("The performance of the models on the test set are summarised in the table below:\n")    
print("\033[1m" + "K-Nearest Neighbors"+ '\033[0m')
precision_knn, recall_knn, specificity_knn, accuracy_knn = report(e_test, knn_prediction_prob_pca, threshold = 0.5)   
print("\n")
print("\033[1m" + "Decision Tree"+ '\033[0m')
precision_dtc, recall_dtc, specificity_dtc, accuracy_dtc = report(e_test, dtc_prediction_prob_pca, threshold = 0.5)   
print("\n")
print("\033[1m" + "Random Forest"+ '\033[0m')
precision_rf, recall_rf, specificity_rf, accuracy_rf = report(e_test, rf_prediction_prob_pca, threshold = 0.5)   
print("\n")
print("\033[1m" + "Neural Network"+ '\033[0m')
precision_nn, recall_nn, specificity_nn, accuracy_nn = report(e_test, nn_prediction_prob_pca.flatten(), threshold = 0.5)   
The performance of the models on the test set are summarised in the table below:

K-Nearest Neighbors
Precision:0.98357
Recall:0.83304
Specificity:0.99652
Accuracy:0.96383


Decision Tree
Precision:0.76257
Recall:0.94957
Specificity:0.92609
Accuracy:0.93078


Random Forest
Precision:0.86254
Recall:0.99304
Specificity:0.96043
Accuracy:0.96696


Neural Network
Precision:0.90686
Recall:0.96522
Specificity:0.97522
Accuracy:0.97322
In [67]:
performance = tuple([precision_knn,precision_dtc,precision_rf,precision_nn,
                     recall_knn,recall_dtc,recall_rf,recall_nn,
                     specificity_knn ,specificity_dtc ,specificity_rf,specificity_nn,
                     accuracy_knn , accuracy_dtc , accuracy_rf, accuracy_nn])

metrics = np.concatenate([["Precision"],
                          ["Recall"],
                          ["Specificity"],
                          ["Accuracy"]])

models = ["K-Nearest Neighbors", "Decision Tree","Random Forest", "Neural Network"]

stacked = [(metric,model) for metric in metrics for model in models]
In [68]:
from bokeh.models import ColumnDataSource, FactorRange
from bokeh.plotting import figure
from bokeh.palettes import Viridis3
from bokeh.models import LabelSet
from bokeh.transform import factor_cmap

reset_output()
output_notebook()

x = stacked
y = np.round(np.array(performance)*100,2)

source = ColumnDataSource(data=dict(x=x, y=y))

p = figure(x_range=FactorRange(*x), 
           plot_height=400, 
           plot_width = 1200, 
           title="Model Comparison from Different Perspectives",
           toolbar_location=None, 
           tools="")

p.vbar(x='x', 
       top='y', 
       width=0.9, 
       source=source,
       fill_color = factor_cmap('x', 
                                 palette = Viridis3, 
                                 factors = models, 
                                 start = 1, 
                                 end = 3))

labels = LabelSet(x='x', 
                  y='y', 
                  text='y', 
                  level='glyph',
                  x_offset=-13, 
                  y_offset=0, 
                  source=source, 
                  text_font_size="11px",
                  render_mode='canvas')

p.add_layout(labels)
p.title.text_font_size = "12px"
p.y_range.start = 0
p.x_range.range_padding = 0.1
p.yaxis.axis_label = 'Performance(%)'
p.xaxis.major_label_orientation = 1
show(p)
Loading BokehJS ...
C:\Users\ozyur\anaconda3\lib\site-packages\bokeh\models\mappers.py:140: UserWarning: Palette length does not match number of factors. ['Neural Network'] will be assigned to `nan_color` gray
  warnings.warn("Palette length does not match number of factors. %s will be assigned to `nan_color` %s" % (extra_factors, self.nan_color))

When we look at the graph above, we see that the Random Forest model predicted 99.3% of the cases with epileptic seizure and 96.04% of seizure-free cases correctly. However, it is also important to note that 13.75% of the 'epilepsy' predictions are incorrect with this model. In other words, the precision score of the model is relatively low.

The K-Nearest Neighbors model, on the other hand, is more precise. 98.36% of the 'epilepsy' predictions are correct. Besides this, 99.65% of the seizure-free cases are predicted correctly with this model. When we look at the recall score, on the other hand, we see that the model predicted only 83.3% of the cases with epileptic seizure are predicted correctly and this is not at the desired level at all.

In general, the main factors that are considered when selecting a model can be summarized as below:

  • The performance of the model.
  • The interpretability of the model.
  • The complexity of the model.
  • Duration of building, training, and the model.
  • Duration of making a prediction.
  • Whether the model meets the business goal or not.

Because predicting people who have epilepsy correctly is the most important criterion in general, I selected the Random Forest model to make predictions even though it is a relatively noninterpretable and complex model.

Analysis of Incorrect Predictions

After making predictions on a new dataset with the Random Forest model, the next step can be to analyze these predictions. image.png

As can be seen from the confusion matrix, 91 cases are predicted as having epilepsy incorrectly even though there was no sign of seizure during the EEG recording in those cases. In other words, there are 91 false positives. This is a relatively high number. To be able to better understand why these cases are predicted as having epilepsy, I will compare the distributions of false positives with the distributions of true negatives from two different perspectives.

In [16]:
incorrect_ind = rf_pca.predict(X_test_reduced_pca)!=e_test
correct_ind = rf_pca.predict(X_test_reduced_pca)==e_test
In [17]:
incorrect_df = pd.DataFrame(X_test[incorrect_ind])
incorrect_df["e"] = e_test.values[incorrect_ind]
In [18]:
correct_df = pd.DataFrame(X_test[correct_ind])
correct_df["e"] = e_test.values[correct_ind]
In [19]:
fp = incorrect_df[incorrect_df["e"]==0]
fn = incorrect_df[incorrect_df["e"]==1]
In [20]:
tn = correct_df[correct_df["e"]==0]
tp = correct_df[correct_df["e"]==1]
In [31]:
tn.T.boxplot(figsize = (30,300),vert=False)
plt.xlim(-7.75,13)
plt.show()

The plot above is the distribution of the true negative samples.

In [30]:
fp.T.boxplot(figsize = (30,13),vert=False)
plt.xlim(-7.75,13)
plt.show()

The plot above is the distribution of the false positive samples.

There is a clear difference between the distribution of false positives and true negatives. There are many more outliers in the distribution of false positive samples. Also, IQR (interquartile range) is larger for the cases that are predicted as having epilepsy even though there was no sign of epileptic seizures during the EEG recording.

In [34]:
tn.boxplot(figsize = (30,30),vert=False)
plt.xlim(-7.85,13.90)
plt.show()

The plot above is the distribution of features of the true negative samples.

In [33]:
fp.boxplot(figsize = (30,30),vert=False)
plt.xlim(-7.85,13.90)
plt.show()

The plot above is the distribution of features of the false positive samples.

Like in the distribution of samples, there is also a clear difference between the distribution of features of false positives and true negatives. There are many outliers and IQR is larger in the features of false positives.

Because I don't know reading EEG recordings, and don't have a domain knowledge, I didn't want to remove outliers in the data preprocessing phase at the beginning of the project. Because an outlier could also be a sign of an epileptic seizure and I didn't want to ignore them.

However, after seeing how different the distributions of false positives and true negatives are, we can handle the samples and features by taking into outliers and IQR into account before making prediction. If we see that the distribution of a sample/feature fits the distribution of the false positive samples/features of false positives, we can repeat the EEG process or adjust the outliers in the sample/feature.